Skip to content

feat(transcode): r2 fixes — typed Arrow + codec_route + partial writes + CachedOntology + route validation#310

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/transcode-r2-fixes-L3DF0
Apr 30, 2026
Merged

feat(transcode): r2 fixes — typed Arrow + codec_route + partial writes + CachedOntology + route validation#310
AdaWorldAPI merged 1 commit into
mainfrom
claude/transcode-r2-fixes-L3DF0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Round-2 follow-up to #309 addressing all five concrete gaps the brutally-honest review called out.

# Issue Fix
1 arrow_type_for_semantic collapsed every semantic type to Utf8 — DataFusion couldn't do WHERE amount > 1000.0 predicate pushdown CurrencyFloat32, Date(_)Date32, CustomerId / InvoiceNumberUInt64
2 No cached Ontology pattern — medcare-rs and smb-office-rs both grew their own *Ontology cache independently New CachedOntology upstream that bundles Arc<Ontology> + eagerly-projected DTOs per locale
3 No drift-route validation test — typo routes silently accepted New validate_route(route, ontology) -> Result<(), String> + 4 tests
4 from_columns requires every declared column — broke PATCH-style upserts New from_columns_partial allows missing Optional/Free; still rejects missing Required + undeclared
5 route_for_column used route_tensor (calibrated for model-weight tensor names) — silently mis-classified document predicates Now reads OuterColumn.codec_route directly, copied from the upstream PropertySpec.codec_route

Why each fix matters

#1 — Real predicate pushdown

The reviewer was right that this is the single biggest functional gap in #309. With Currency → Float32, MedCare's lab-result value column can finally take WHERE value > 5.0 through DataFusion's scan filter rather than a string compare. Three semantic types covered (matching the reviewer's "at minimum" suggestion); the rest stay Utf8 by deliberate choice — round 3 can pivot specific ones (e.g. Geo → struct{lat, lon}) when a consumer actually asks.

fn arrow_type_for_semantic(st: &SemanticType) -> ArrowTypeCode {
    match st {
        SemanticType::Currency(_) => ArrowTypeCode::Float32,
        SemanticType::Date(_) => ArrowTypeCode::Date32,
        SemanticType::CustomerId | SemanticType::InvoiceNumber => ArrowTypeCode::UInt64,
        _ => ArrowTypeCode::Utf8,
    }
}

#2CachedOntology upstream

Both consumer crates had to build their own caching wrapper. One canonical implementation here, with Arc<OntologyDto> clones that are pointer-equal across calls.

#5 — Drift-by-construction-impossible

This is the design correctness fix. PropertySpec already carries codec_route. Round-1's heuristic via route_tensor was calibrated for model-weight tensor names and would silently mis-classify document predicates. Round-2 reads the contract's own field. The transcode layer can never disagree with the schema author's intent.

pub fn route_for_column(col: &OuterColumn) -> CodecRoute {
    col.codec_route  // copied from PropertySpec.codec_route at derivation
}

#3 — Route-validation tests

4 tests covering: accept-valid (/api/patient/42), reject-typo (/api/patient_typo/42), reject-missing-prefix (/v1/patient/42), reject-empty (/api/).

#4 — Partial-row PATCH

Optional/Free columns may be omitted (filled with Arrow null arrays via arrow::array::new_null_array). Required columns still mandatory. Undeclared columns still rejected. The strict from_columns keeps its full-row semantics; from_columns_partial is the new entry point for PATCH callers.

What's still deferred

The reviewer listed two more items as round-2 lifts that this PR does not address:

  • Phase 4 (NARS cold sink) — not covered. Continues to be a future PR aligned with .claude/plans/sql-spo-ontology-bridge-v1.md.
  • Phase 5 (BindSpace → outer-DTO direction)zerocopy.rs still only goes outer → Arrow. The reverse path (BindSpace columns → external SoA batch) needs the producer-side accessor in cognitive-shader-driver first.

Both are honestly tagged in the original module docs and should land incrementally.

Test scoreboard

#309   transcode::      26 passed, 0 failed
#310   transcode::      33 passed, 0 failed   (+7 new + 1 rewritten)

Verification

  • cargo check: clean across 3 feature combos — default, query-lite, query-lite + audit-log + auth-rls-lite
  • cargo test transcode::: 33 passed, 0 failed
  • cargo clippy: zero transcode warnings across the same combos
  • rustfmt --check: clean across all 4 modified files

Files changed

  • crates/lance-graph-callcenter/src/transcode/mod.rs+CachedOntology + 3 new tests
  • crates/lance-graph-callcenter/src/transcode/zerocopy.rs — typed semantic-type → Arrow mapping; +from_columns_partial; +OwnedColumn::Date32; +OuterColumn.codec_route field
  • crates/lance-graph-callcenter/src/transcode/cam_pq_decode.rsroute_for_column reads OuterColumn.codec_route; rewritten test
  • crates/lance-graph-callcenter/src/transcode/parallelbetrieb.rs+validate_route + 4 new tests

Generated by Claude Code


Generated by Claude Code

                  partial writes + CachedOntology + route validation

Addresses the five concrete gaps the brutally-honest review on #309
called out:

  1. arrow_type_for_semantic no longer collapses everything to Utf8.
     Currency → Float32, Date(_) → Date32, CustomerId / InvoiceNumber
     → UInt64. DataFusion can now do real numeric / temporal predicate
     pushdown on those columns. The remaining semantic types stay
     Utf8 by deliberate choice — round 3 may pivot specific ones (Geo
     → struct{lat,lon}) when a consumer asks.

  2. CachedOntology helper extracted upstream. Bundles an Arc<Ontology>
     with eagerly-projected DTOs per Locale (De/En). Prevents the
     per-call OntologyDto::from_ontology rebuild that medcare-rs's
     MedcareOntology and smb-office-rs's session ontology both grew
     independently. One implementation, one bug surface.

  3. validate_route(route, ontology) added to parallelbetrieb. Parses
     /api/{entity_type}/{...} and asserts entity_type resolves to a
     declared Schema.name (case-insensitive). 4 tests covering
     accept-valid, reject-typo, reject-missing-prefix, reject-empty.
     Used as a pre-flight check for static route lists; the runner
     itself doesn't gate on it because typo routes are still genuine
     drift telemetry.

  4. from_columns_partial added for PATCH-style upserts. Allows
     omitted Optional / Free columns (filled with Arrow null arrays);
     still rejects missing Required columns and undeclared columns.
     The existing from_columns keeps its strict full-row contract.

  5. route_for_column now reads OuterColumn.codec_route directly,
     copied from the upstream PropertySpec.codec_route at schema
     derivation. Round-1 went through route_tensor which is
     calibrated for model-weight tensor names (q_proj, lm_head, ...)
     and would silently mis-classify document predicates. The
     contract's own field is now the source of truth — drift-by-
     construction impossible.

Adds 7 new tests (6 in transcode/ + 1 implicit in cam_pq_decode
where the existing test was rewritten to assert the new semantics):
  - cached_ontology_projects_every_locale_at_construction
  - cached_ontology_clones_are_arc_cheap
  - cached_ontology_inner_round_trips
  - validate_route_accepts_known_entity_type
  - validate_route_rejects_typo_entity_type
  - validate_route_rejects_missing_api_prefix
  - validate_route_rejects_empty_entity_segment
  - route_for_required_scalar_column_uses_property_spec_default
    (rewritten from route_for_scalar_columns_skips_codec; old test
    asserted the wrong thing — required scalars should default to
    Passthrough, not Skip)

Verified:
  cargo check across {default, query-lite, query-lite+audit-log+
    auth-rls-lite} — clean
  cargo test transcode:: → 33/33 passed (was 26/26 in #309; +7
    new + 1 rewritten)
  cargo clippy on the same combos — zero transcode warnings
  rustfmt --check — clean across all 4 modified files
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee50cc4d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fn arrow_type_for_semantic(st: &SemanticType) -> ArrowTypeCode {
match st {
SemanticType::Currency(_) => ArrowTypeCode::Float32,
SemanticType::Date(_) => ArrowTypeCode::Date32,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Map DateTime semantic types to timestamp-capable Arrow type

arrow_type_for_semantic now sends every SemanticType::Date(_) to Date32, but SemanticType::Date includes DatePrecision::DateTime; this collapses all intra-day values to whole days and makes time-of-day predicates impossible for those fields. Any ontology property marked as Date(DateTime) will either lose precision at write time or be forced into an incompatible column representation.

Useful? React with 👍 / 👎.

/// `OuterColumn.semantic_type` field.
fn arrow_type_for_semantic(st: &SemanticType) -> ArrowTypeCode {
match st {
SemanticType::Currency(_) => ArrowTypeCode::Float32,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve monetary precision for currency columns

Mapping SemanticType::Currency(_) to Float32 introduces rounding for financial amounts (especially at larger magnitudes), which can change cents-level values and produce unstable equality/aggregation behavior. This is a regression from lossless string transport and can corrupt financial semantics in consumers that rely on exact totals or exact-match filters.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 8332bbc into main Apr 30, 2026
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant